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Abstract 



We consider string matching with variable length gaps. Given a string T and a pattern P 
consisting of strings separated by variable length gaps (arbitrary strings of length in a specified 
range), the problem is to find all ending positions of substrings in T that match P. This problem 
is a basic primitive in computational biology applications. Let m and n be the lengths of P 
and T, respectively, and let k be the number of strings in P. We present a new algorithm 
achieving time 0(n log k + m + a) and space 0(m + A), where A is the sum of the lower bounds 
of the lengths of the gaps in P and a is the total number of occurrences of the strings in P 
within T. Compared to the previous results this bound essentially achieves the best known time 
and space complexities simultaneously. Consequently, our algorithm obtains the best known 
bounds for almost all combinations of m, n, k, A, and a. Our algorithm is surprisingly simple 
and straightforward to implement. We also present algorithms for finding and encoding the 
positions of all strings in P for every match of the pattern. 

1 Introduction 

Given integers a and b, < a < b, a variable length gap g{a, b} is an arbitrary string over £ of 
length between a and 6, both inclusive. A variable length gap pattern (abbreviated VLG pattern) 
P is the concatenation of a sequence of strings and variable length gaps, that is, P is of the form 



A VLG pattern P matches a substring S of T iff S = P\ ■ G\ ■ ■ ■ G^-i • -Pfe, where Gi is any string 
of length between a« and bi, i = 1, . . . ,k — 1. Given a string T and a VLG pattern P, the variable 
length gap problem (VLG problem) is to find all ending positions of substrings in T that match P. 

Example 1 As an example, consider the problem instance over the alphabet S = {A, G, C, T}: 



The solution to the problem instance is the set of positions {17,28,31}. For example the solution 
contains 17, since the substring ATCGGCTCCAGACCAGT, ending at position 17 in T, matches 
P. 

*An extended abstract of this paper appeared in proceedings of the 17th Symposium on String Processing and 
Information Retrieval. 



P = P\- g{ai,h} ■ P 2 ■ g{a 2 ,b 2 



} ■ ■ ■ g{ak-i,bk-i} ■ Pk ■ 



T = ATCGGCTCCAGACCAGTACCCGTTCCGTGGT 



P = A-g{6,7}-CC-g{2,6}-GT 
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Variable length gaps are frequently used in computational biology applications [7,8,14,16,17]. 
For instance, the PROSITE data base [5, 10] supports searching for proteins specified by VLG 
patterns. 

1.1 Previous Work 

We briefly review the main worst-case bounds for the VLG problem. As above, let P = P\ ■ 
g{a\, b\} ■ P2 ■ g{a,2, 62} • • • g{a>k-ii bk-i} • Pfc be a VLG pattern consisting of k strings, and let T be 
a string. To state the bounds, let m = Yli=i I -Pi I be the sum of the lengths of the strings in P and 
let n be the length of T. 

The simplest approach to solve the VLG problem is to translate P into a regular expression 
and then use an algorithm for regular expression matching. Unfortunately, the translation produces 
a regular expression significantly longer than P, resulting in an inefficient algorithm. Specifically, 
suppose that the alphabet £ contains a characters, that is, £ = {c±, . . . , c a }. Using standard regular 
expression operators (union and concatenation), we can translate g{a,b} into the expression 



where C is shorthand for the expression (ci | C2 | ...c CT ). Hence, a variable length gap g{a, b}, 
represented by a constant length expression in P, is translated into a regular expression of length 
Q(ab). Consequently, a regular expression R corresponding to P has length Q.{Ba + m), where 
B = 5^i=i &i i s the sum of the upper bounds of the gaps in P. Using Thompson's textbook 
regular expression matching algorithm [20] this leads to an algorithm for the VLG problem using 
0(n(Ba + m)) time. Even with the fastest known algorithms for regular expression matching this 
bound can only be improved by at most a polylogarithmic factor [2,3, 15, 18]. 

Several algorithms that improve upon the direct translation to a regular expression matching 
problem have been proposed [4,6-8, 12-14, 16, 17, 19]. Some of these are able to solve more general 
versions of the problem, such as searching for patterns that also contain character classes and 
variable length gaps with negative length. Most of the algorithms are based on fast simulations 
of non-deterministic finite automata. In particular, Navarro and Raffinot [17] gave an algorithm 
using 0(n( m ^ B + 1)) time, where w is the number of bits in a memory word. Fredrikson and 
Grabowski [7, 8] improved this bound for the case when all variable length gaps have lower bound 
and identical upper bound b. Their fastest algorithm achieves 0(n( mlo ^ logb + 1)) time. Very 
recently, Bille and Thorup [4] gave an algorithm using 0(n(k 1 ^^ + log k) + mlogm + A) time and 
0{m + A) space, where A = J^^Tj 1 a« is the sum of the lower bounds on the lengths of the gaps. 
Note that if we assume that the nk term dominates and ignore the wj log w factor, the time bound 
reduces to 0(nk). 

An alternative approach, suggested independently by Morgante et al. [13] and Rahman et 
al. [19], is to design algorithms that are efficient in terms of the total number of occurrences of the 
k strings Pi, . . . , P^ within T. Let a be this number, e.g., in Example 1 A, CC, and GT occur 
5, 5, and 4 times in T. Hence, a = 5 + 5 + 4= 14. Rahman et al. [19] gave an algorithm using 
0(n log A; + m + a log(max 1 <j < fc(6j — Oj))) time 1 . Morgante et al. [13] gave a faster algorithm using 
0(n log k + m + a) time. Each of the k strings in P can occur at most n times and therefore a < nk. 

lr The bound stated in the paper does not include the log k factor, since they assume that the size of the alphabet 
is constant. We make no assumption on the alphabet size and therefore include it here. 



a 



b—a 



g{a,b} = C---C(C\e)---(C\e) : 
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Hence, in the typical case when the strings occur less frequently, i.e, a = o(n(k +log k)), these 
approaches are faster. However, unlike the automata based algorithm that only use 0{m + A) 
space, both of these algorithm use 6 (to + a) space. Since a typically increases with the length of 
T, the space usage of these algorithms is likely to quickly become a bottleneck for processing large 
biological data bases. 

1.2 Our Results 

We address the basic question of whether is it possible to design an algorithm that simultaneously is 
fast in the total number of occurrences of the k strings and uses little space. We show the following 
result. 

Theorem 1 Given a string T and a VLG pattern P with k strings, we can solve the variable length 
gaps matching problem in time 0(n log k + to + a) and space 0(m + A). Here, a is the number of 
occurrences of the strings of P inT and A is the sum of the lower bounds of the gaps. 

Hence, we match the best known time bounds in terms of a and the space for the fastest automata 
based approach. Consequently, whenever a = o(n(k^^- + log A;)) the time and space bounds of 
Theorem 1 are the best known. Our algorithm uses a standard comparison based version of the 
Aho-Corasick automaton for multi-string matching [1]. If the size of the alphabet is constant or 
we use hashing the log k factor in the running time disappears. Furthermore, our algorithm is 
surprisingly simple and straightforward to implement. 

In some cases, we may also be interested in outputting not only the ending positions of matches 
of P, but also the positions of the individual strings in P for each match of P in T. Note that 
there can be exponentially many, i.e., ^(r^"^ 1 + h — at), of these occurrences ending at the same 
position in T. Morgante et al. [13] showed how to encode all of these in a graph of size 0(a). We 
show how our algorithm can be extended to efficiently output such a graph. Furthermore, we show 
two solutions for outputting the positions encoded in the graph. Both solutions use little space 
since they avoid the need to store the entire graph. The first solution is a black-box solution that 
works with any algorithm for constructing the graph. The second is a direct approach obtained 
using a simple extension of our algorithm. 

Recently, Haapasalo et al. [9] studied practical algorithms for an extension of the VLG problem 
that allows multiple patterns and gaps with unbounded upper bounds. We note that the result of 
Thm. 1 is straightforward to generalize to this case. 

1.3 Technical Overview 

The previous work by Morgante et al. [13] and Rahman et al. [19] find all of the a occurrences 
of the strings P\ , . . . , of P in T using a standard multi-string matching algorithm (see Sec- 
tion 2.1). From these, they construct a graph of size Q(a) to represent possible combinations of 
string occurrences that can be combined to form occurrences of P. 

Our algorithm similarly finds all of the occurrences of the strings of P in T. However, we show 
how to avoid constructing a large graph representing the possible combinations of occurrences. In- 
stead we present a way to efficiently represent sufficient information to correctly find the occurrences 
of P, leading to a significant space improvement from 0(m + a) to 0(m + A). Surprisingly, the 
algorithm needed to achieve this space bound is very simple, and only requires maintaining a set 
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of sorted lists of disjoint intervals. Even though the algorithm is simple the space bound achieved 
by it is non-obvious. We give a careful analysis leading to the 0(m + A) space bound. 

Our space-efficient black-box solution for reporting the positions of the individual strings in P 
for each match of P in T is obtained by constructing the graph for overlapping chunks of T of 
size 2(m + B). Hence the solution is parametrized by the time and space complexity of the actual 
algorithm used to construct the graph. 

2 Algorithm 

In this section we present the algorithm. For completeness, we first briefly review the classical Aho- 
Corasick algorithm for multiple string matching in Section 2.1. We then define the central idea of 
relevant occurrences in Section 2.2. We present the full algorithm in Section 2.3 and analyze it in 
Section 3. 

2.1 Multi-String Matching 

Given a set of pattern strings V = {Pi, ... , Pk} of total length m and a text T of length n the 
multi-string matching problem is to report all occurrences of each pattern string in T. Aho and 
Corasick [1] generalized the classical Knuth-Morris-Pratt algorithm [11] for single string matching 
to multiple strings. The Aho-Corasick automaton (AC-automaton) for V, denoted AC(V), consists 
of the trie of the patterns in V. Hence, any path from the root of the trie to a state s corresponds 
to a prefix of a pattern in V. We denote this prefix by path(s). For each state s there is also a 
special failure transition pointing to the unique state s' such that path(s') is the longest prefix of 
a pattern in V matching a proper suffix of path(s). Note that the depth of s' in the trie is always 
strictly smaller for non-root states than the depth of s. 

Finally, for each state s we store the subset occ(s) QV of patterns that match a suffix of path(s). 
Since the patterns in occ(s) share suffixes we can represent occ(s) compactly by storing for s the 
index of the longest string in occ(s) and a pointer to the state s' such that path(s') is the second 
longest string if any. In this way we can report occ(s) in 0(|occ(s)|) time. 

The maximum outdegree of any state is bounded by the number of leaves in the trie which 
is at most k. Hence, using a standard comparison-based balanced search tree to index the trie 
transitions out of each state we can construct AC(V) in 0(mlogk) time and 0(m) space. 

To find the occurrences of V in T, we read the characters of T from left-to-right while traversing 
AC(V) to maintain the longest prefix of the strings in V matching T. At a state s and character 
c we proceed as follows. If c matches the label of a trie transition t from s, the next state is the 
child endpoint of t. Otherwise, we recursively follow failure transitions from s until we find a state 
s' with a trie transition t' labeled c. The next state is then the child endpoint of t' . If no such state 
exists, the next state is the root of the trie. For each failure transition traversed in the algorithm 
we must traverse at least as many trie transitions. Therefore, the total time to traverse AC('P) and 
report occurrences is 0(n log k + a), where a is the total number of occurrences. 

Hence, the Aho-Corasick algorithm solves multi-string matching in 0((n + m)\ogk + a) time 
and 0(m) space. 
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Not relevant - 
Relevant - 




T + o» + 1 



r + bi + 1 



Position in T 



Figure 1: In this figure x is an occurrence of Pj in T reported at position r. The first and last 
occurrence of Pi+\ start outside R(x) thereby violating the ith gap constraint, so these occurrences 
are not relevant compared to x. The second occurrence y of Pj + i starts in R(x), so if x is itself 
relevant, then y is also relevant. 



2.2 Relevant Occurrences 

For a substring x of T, let startpos(x) and endpos{x) denote the start and end position of x in T, 
respectively. Let x be an occurrence of Pi with r = endpos(x) in T, and let P(x) denote the range 
[T + ai + l;T + bi + l] in T. 

An occurrence y of Pj in T is a relevant occurrence of Pj iff i = 1 or startpos(y) € P(x), for 
some relevant occurrence x of Pj_i. See Fig. 1 for an example. Relevant occurrences are similar to 
the valid occurrences defined in [19]. The difference is that a valid occurrence is an occurrence of 
Pj+l that is in R(x) for any occurrence x of P, in T, i.e., x need not be a valid occurrence itself. 

From the definition of relevant occurrences, it follows directly that we can solve the VLG 
problem by finding the relevant occurrences of P& in T. Specifically, we have the following result. 

Lemma 1 Let S be a substring of T matching the VLG pattern S\ ■ g{a\, b±} ■ S2 ■ g{a2, ^2} • • • 5fc. 
Then, startpos{Si + \) £ R(Si) for all i = 1, . . . , k — 1. 



2.3 The Algorithm 

Algorithm 1 computes the relevant occurrences of P~ using the output from the AC automaton. 
The idea behind the algorithm is to keep track of the ranges defined by the relevant occurrences 
of each subpattern Pj, such that we efficiently can check if an occurrence of Pj is relevant or not. 
More precisely, for each subpattern Pj, i = 2, . . . , k, we maintain a sorted list Li containing the 
ranges defined by previously reported relevant occurrences of Pj_i. When an occurrence of Pj is 
reported by the AC automaton, we can determine whether it is relevant by checking if it starts in 
a range contained in Lj (step 2b). Initially, the lists L2, L3, . . . , L^ are empty. When a relevant 
occurrence of Pj is reported, we add the range defined by this new occurrence to the end of -Lj+i. 
In case the new range [s,t] overlaps or adjoins the last range [q,r] in Lj+i (s < r + 1) we merge the 
two ranges into a single range [q,t\. 

Let r denote the current position in T. A range [a, b] € Lj is dead at position r iff b < r — |Pj|. 
When a range is dead no future occurrences y of Pi can start in that range since endpos(y) > r 
implies startpos(y) > t — |Pj|. In Fig. 1 the range R(x) defined by x dies, when position u is reached. 
Our algorithm repeatedly removes any dead ranges to limit the size of the lists L2 , L3, . . . , L^ . To 
remove the dead ranges in step 2a we traverse the list and delete all dead ranges until we meet a 
range that is not dead. Since the lists are sorted, all remaining ranges in the list are still alive. See 
Fig. 2 for an example. 
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Algorithm 1 Algorithm solving the VLG problem for a VLG pattern P and a string T. 

1 . Build the AC-automaton for the subpatterns P\ , P2 , . . . , Pk ■ 

2. Process T using the automaton and each time an occurrence x of Pi is reported at position 
r = endpos(x) in T do: 

(a) Remove any dead ranges from the lists Lj and Lj+i. 

(b) If i = 1 or r — |Pj| = startpos(x) is contained in the first range in Lj do: 

i. If i < k: Append the range R(x) = [r + aj + 1; r + hi + 1] to the end of Lj+i. If the 
range overlaps or adjoins the last range in Lj+i, the two ranges are merged into a 
single range. 

ii. If i = k: Report r. 



3 Analysis 

We now show that Algorithm 1 solves the VLG problem in time 0(n\ogk + m + a) and space 
0(m + A), implying Theorem 1. 

3.1 Correctness 

To show that Algorithm 1 finds exactly the relevant occurrences of Pk, we show by induction 
on i that the algorithm in step 2b correctly determines the relevancy of all occurrences of Pi, 
i = 1,2, . . . ,k, in T. 

Base case: All occurrences of P\ are by definition relevant and Algorithm 1 correctly determines 
this in step 2b. 

Inductive step: Let y be an occurrence of Pi, i > 1, that is reported at position r. There are two 
cases to consider. 

1. y is relevant. By definition there is a relevant occurrence x of Pj-i in T, such that 
startpos(y) = r — |Pj| G R(x). By the induction hypothesis x was correctly determined 
to be relevant by the algorithm. Since endpos(x) < r, R(x) was appended to Lj earlier in 
the execution of the algorithm. It remains to show that the range containing startpos{y) 
is the first range in Lj in step 2b. When removing the dead ranges in Li in step 2a, all 
ranges [a, b] where b < r — \Pi\ are removed. Therefore the range containing r — \Pi\ = 
startpos(y) is the first range in Lj after step 2a. It follows that the algorithm correctly 
determines that y is relevant. 

2. y is not relevant. Then there exists no relevant occurrence x of P\-\ such that startpos(y) € 
R(x). By the induction hypothesis there is no range in Li containing startpos{y), since 
the algorithm only append ranges when a relevant occurrence is found. Consequently, 
the algorithm correctly determines that y is not relevant. 
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Figure 2: The occurrences of the subpatterns P± = A, P2 = CC and P3 = GT and the ranges they 
define in the text T from Example 1. Occurrences which are not relevant are crossed out. The bold 
occurrences of P3 are the relevant occurrences of Pk and their end positions 17,28 and 31 constitute 
the solution to the VLG problem. Consider the point in the execution of the algorithm when the 
occurrence x of P2 at position r = 26 is reported by the Aho-Corasick automaton. At this time 
L 2 = [ [17; 20], [22; 23], [25; 26] ] and L 3 = [ [23, 28] ] . The ranges [17; 20] and [22; 23] are now dead 
and are removed from L2 in step 2a. In step 2b the algorithm determines that x is relevant and 
R(x) = [29; 33] is appended to L 3 : L 3 = [ [23; 33] ] . 



3.2 Time and Space Complexity 

The AC automaton for the subpatterns Pi, P2, . . . , Pk can be built in time 0(m log k) using 0(m) 
space, where m = ^? =1 \Pi\- In the trivial case when m > n we do not need to build the automaton. 
Hence, we will assume that m < n in the following analysis. For each of the a occurrences of the 
strings P±, P2, . . . , Pk Algorithm 1 first removes the dead ranges from Lj and Lj+i and performs a 
number of constant-time operations. Since both lists are sorted, the dead ranges can be removed by 
traversing the lists from the beginning. At most a ranges are ever added to the lists, and therefore 
the algorithm spends 0{ct) time in total on removing dead ranges. The total time is therefore 
0((n + m) log k + a) = 0(n log k + m + a). 

To prove the space bound, we first show the following lemma. 

Lemma 2 At any time during the execution of the algorithm we have 



2q_i + \Pi \ + ai 



Ci-l + 1 

for i = 2, 3, . . . , k, where Cj = b{ — ai + 1. 



O 



\Pj \ + Oj-l 

bi-i - a,_i + 2 



Proof. Consider list Lj for some i = 2, . . . , k. Referring to Algorithm 1, the size of the list L.; 
is only increased in step 2(b)i, when a range R(xj) defined by a relevant occurrence Xj of Pi-i is 
reported and R(xj) does not adjoin or overlap the last range in Lj. 

Let R{x\) = [s, t] be the first range in Lj at an arbitrary time in the execution of the algorithm. 
We bound the number of additional ranges that can be added to Lj from the time R{x\) became the 
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xi is reported and 
R(x\) is added to Li 



Last position where 
R(xi) is still alive 



Position in T 



Figure 3: The worst-case situation where £, the maximum number of ranges are present in Lj. The 
figure only shows the first and the last occurrence of Pi-\ ix\ and xg) defining the I ranges. 



first range in Lj until R{x\) is removed. The last position where R{x\) is still alive is r a = t+ \Pi\ — 1. 
If a relevant occurrence xg of Pj_i ends at this position, then the range R(xe) = [r a + aj_i + 1; r a + 
+ 1] is appended to Lj. Hence, the maximum number of positions d from i to the end of R{xg) 

is 

d = r a + bi-i + 1 - t 
= (t+\P i \-l) + b i - 1 + l-t 
= \Pi\ + 

= |Pj| + aj_i + Cj_i - 1 . 

In the worst case, all the ranges in Lj are separated by exactly one position as illustrated in Fig. 3. 
Therefore at most |_d/(cj-i + 1)J additional ranges can be added to Lj before R(xi) is removed. 
Counting in R(x\) yields the following bound on the size of Lj 



\Li\ < 



d 



+ 1 



+ 1 



2cj_i + \Pi \ + a 



i-i 



Ci-l + 1 



O 



\Pi\ + CLi- 



i-1 



k-i ~ <H-i + 2 



□ 



By Lemma 2 the total number of ranges stored at any time during the processing of T is at 
most 

(y u m + a *-\ )=o(y +Y u Q * - )=0(m + A) . 

Each range can be stored using 0(1) space, so this is an upper bound on the space needed to 
store the lists L2, . . . , L^. The AC-automaton uses 0(m) space, so the total space required by our 
algorithm is 0(m + A). 

In summary, the algorithm uses 0(n log k-\-m + a) time and 0(m + A) space. This completes 
the proof of Theorem 1. 



8 



4 Complete Characterization of Occurrences 



In this section we show how our algorithm can be extended to report not only the end position of 
Pk, but also the positions of Pi, P2, ■ • • , Pfc-i for each occurrence of P in T. 

The main idea is to construct a graph that encodes all occurrences of the VLG-pattern using 
0(a) space. For each occurrence of the VLG-pattern, the positions of the individual subpatterns 
can be reported by traversing this graph. This approach was also used by Rahman et al. [19] and 
Morgante et al. [13]. We give a fast new algorithm for constructing this graph and show a black-box 
solution that can report the occurrences of the VLG-pattern without storing the complete graph. 

We introduce the following simple definitions. If P occurs in the text T, then a match combi- 
nation is a sequence e\, . . . , e& of end positions of P±, . . . , Pk in T corresponding to the match. The 
total number of match combinations of P in T is denoted f3. Note that there can be many match 
combinations corresponding to a single match. See Fig. 4. 

G---C--A--- 
G--C---A--- 

G - - C - A - - - - - - T ^-The five match combinations 

GC A - - - 

GC---A 

ATCGGCTCCAGACCAGTACCCGTTCCGTGGT 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 



Figure 4: The text sequence is the same as in the previous examples. The substring S from 
position 5 to 17 (highlighted in bold) matches the VLG-pattern Q = G ■ #{0,3} • C ■ g{l,6} ■ 
A ■ g{2, 7} • T. As the figure shows, this match contains the following five match combinations: 
[5,9,12,17], [5,8,12,17], [5,8,10,17], [5,6,12,17], [5,6,10,17]. 

Due to the inequality of arithmetic and geometric means, the total number of match combinations j3 
is maximized, when the a occurrences are distributed evenly over Pi, P2, . . . , Pk and each occurrence 
of Pi is compatible to all occurrences of Pj_i for i = 2, . . . , k. So in the worst case j3 = ((f ) fe ), 
which is exponential in the number of gaps. All these match combinations can be encoded in a 

2 

directed graph using O(tt) space as follows. The nodes in the graph are the relevant occurrences 
of Pi, P2, . . . , Pk in T. Two nodes x of P«-i and y of Pi are connected by an edge from y to x if 
and only if startpos(y) 6 R(x). In that case we also say that x and y are compatible. We denote 
this graph as the gap graph for P and T. See Fig. 5. Since the number of nodes in the gap graph 
is at most a, and there are O ((f) 2 ) edges between the k layers in the worst case, we can store the 

2 

graph using 0(\) space. 
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Figure 5: The gap graph for the VLG-pattern R = C • g{0, 3} • G ■ g{3, 10} • A and the text 
T = CTGGCCCCGCTCCACGTTGAGCGGCGCTGAG. 

If the j occurrences x\, X2, ■ ■ • , Xj of P{ (appearing in that order in T) are all compatible with the 
same occurrence y of Pj+i, then the j edges (y,xi), (y, X2), • • • , (y,Xj) are all present in the gap 
graph. Due to the following lemma, the edges (y, X2), ■ ■ ■ , (y, Xj-i) are redundant. 

Lemma 3 Let x\ and X2 be two occurrences of Pi, i = 1, . . . , k — 1, both compatible with the same 
occurrence y of Pj+i- Assume without loss of generality that startpos{x\) < startpos(x2) and let x' 
be another occurrence of such that startpos{x\) < startpos(x') < startpos(x2) , then x' is also 
compatible with y. 

Proof. Since startpos{y) G R(x±) and startpos(y) € R{x2), we have that startpos{y) £ R(xi) n 
R{x2)- Furthermore since startpos{x\) < startpos{x') < startpos{x2), it holds that R(x\)r\R(x2) C 
R(x'), so startpos(y) 6 R(x'). □ 



Leaving out the redundant edges in the gap graph, we get a new graph, which we denote the implicit 
gap graph. For an example, see Fig. 6. In this graph the out-degree of each node is at most two, 
so the number of edges is now linear in the number of nodes, and consequently we can store the 
implicit gap graph using 0(a) space. 




CTGGCCCCGCTCCACGTTGAGCGGCGCTGAG 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

Figure 6: The implicit gap graph for the VLG-pattern R = C ■ g{0, 3} • G ■ g{3, 10} • A and the text 
T = CTGGCCCCGCTCCACGTTGAGCGGCGCTGAG. The out-degree of each node is at most 
two. Compare to Fig. 5. 

In the context of these new definitions, we are interested in solving the two following problems: 

The reporting variable length gaps problem (RVLG problem) is to output all match combi- 
nations of P in T. 

The implicit reporting variable length gaps problem (IRVLG problem) is to output the im- 
plicit gap graph of all match combinations of P in T. 
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X2 
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Pi \- 
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GACACACCTGGCATAGCCGA 



L 



f . 

2 ■ 



7 8 9 10 11 12 13 14 15 16 17 18 19 20 
xi „ x 2 „ x 3 , Position in T 



Xl X2 



X3 



Figure 7: Example showing how the two lists L 2 and L 2 store the first and most recent range to 
cover a position in the text, respectively. The VLG-pattern is AC ■ g{l, 5} • T. When the occurrence 
y of P2 = T at position 9 is reported, we can check the two lists to see that x\ is the first and X3 
is the last occurrence of Pi compatible with y. 



4.1 Constructing the Implicit Gap Graph 

Algorithm 2 describes how to build the implicit gap graph. Recall that in Algorithm 1 the ranges 
in Li allowed us to determine the relevancy of a newly reported occurrence x of Pi by inspecting 
the first range in Li (after the dead ranges had been removed). To build the implicit gap graph, 
we need to not only determine the relevancy of x, but also the first and last occurrence of Pj_i 
compatible with x. This information allows us to add the correct edges to the implicit gap graph. 

To do this, we replace the list Lj with two lists l{ and L|, for i = 2, . . . , k. The idea is that when 
a position in the text is covered by multiple ranges, l{ contains the first range and L\ contains the 
most recent range to cover that position. See Fig. 7. Each range [s, t] in L^ or L\ now also has a 
reference to the occurrence x of Pi-\ that defined it, and we will denote the range [s, t] x to indicate 
this. When an occurrence x of Pi is reported, we first remove dead ranges from the lists l{, Lf, 
and Lf +1 as was done in Algorithm 1. If x is relevant a node representing x is added to the 
implicit gap graph in step 2(b)i. In step 2(b)ii, provided that x is not an occurrence of Pi, the two 
out-going edges of x are added by inspecting L{ and L\ to determine the first and last occurrence 
of Pj_i compatible with x. Unless x is an occurrence of P&, the range R(x) = [r + aj + 1; r + b{ + 1] 
is added to the lists L{ +1 and Lf +1 in step 2(b)iii as described in the following section. 

4.1.1 Maintaining the Range Lists 

When adding a range [s, t] x defined by an occurrence x of Pj to l{ +1 and Lf +l , we simply append it 
to the end of the list if it does not overlap the last range in the list. Otherwise, to avoid overlapping 
ranges, we appropriately shorten either the newly added range [s,t] x (for l{ +1 ) or the last range in 
the list (for Lf +1 ). The way L{ is maintained ensures that the first range that covers some position 
t in T will remain the only range covering this position in L? . Conversely, L\ will store the most 
recent range covering r. In Algorithm 2 the steps 2(b)iii, A, B and C append and possibly shorten 
the ranges according to this strategy. 
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Algorithm 2 Algorithm solving the IRVLG problem for a VLG pattern P and a string T. 



1 . Build the AC-automaton for the subpatterns P\ , P2 , . . . , Pk ■ 

2. Process T using the automaton and each time an occurrence x of Pi is reported at position 
r = endpos(x) in T do: 

(a) Remove any dead ranges from the lists l{ , L|, and Lf +1 . 

(b) If i = 1 or if r — \Pi\ = startpos(x) is contained in the first range in L\ (i.e., x is a 
relevant occurrence) do: 

i. Add the node x to the implicit gap graph. 

ii. If i > 1: Add the edges (x, y) and (x,z) to the implicit gap graph, where y and z 
are the occurrences of Pi-i defining the first range in l{ and L|, respectively. 

in. If i < fc: Let [g,r] w and [q',r'] w ' denote the first and last range in Lf +1 and Lj +l , 
respectively. 

A. Append the range [max(r + 1, r + Oj + 1), r + 6j + l] x to the end of L! +1 . 

B. Change the last range in L e i+l to [q f , min(r', r + aj)] w /. 

C. Append the range [r + d{ + 1; r + 6« + l] x to the end of Lf +1 . 



4.1.2 Time and Space Analysis 

As for Algorithm 1, the time spent for each of the at most a relevant occurrences reported by the AC 
automaton is amortized constant. Hence the implicit gap graph can be built in 0(n log k + m + a) 
time. Storing the implicit gap graph for the entire text takes space 0(a), since each of the at most 
a nodes has at most two out-going edges. 

We now consider the space needed to store the lists L{ and L\. The ranges in L{ and L\ are no 
longer guaranteed to have size Cj_i = 6j_i — aj_i + 1 nor being separated by at least one position, 
so the bound of Lemma 2 needs to be revised, resulting in a slightly increased space bound for 
storing the lists. Referring to Fig. 3, the number of ranges in L{ or h\ at any point in time is at 
most 

d+l = Ci_i + \Pi\ - 1 + ai-i + 1 = \Pi\ + 6i_i + 1 . 

Summing up, the total space required to store the lists increases from 0(m + A) to 0(m + B), 
where B = Yli=i i s the sum of the upper bounds of the lengths of the gaps. 
Recapitulating, we have the following theorem 

Theorem 2 The IRVLG problem can be solved in time 0(n log k + m + a) and space 0(m + B + a). 

4.2 A Black-Box Solution for Reporting Match Combinations 

The number of match combinations, f3, can be exponential in the number of gaps. The implicit 
gap graph space efficiently encodes all of these match combinations in a graph of size 0(a). Thus, 
a straightforward solution to the RVLG problem is to construct the implicit gap graph and sub- 
sequently traverse it to report the match combinations. Each of the /3 match combinations is a 
sequence of k integers, so this solution to the RVLG takes time 0(n log k + m + a + k/3) and space 
0(m + B + a). 
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We now show that the RVLG problem can be space efficiently solved using any black-box 
algorithm for the IRVLG problem. The main idea is a simple splitting of T into overlapping smaller 
substrings of suitable size. We solve the problem for each substring individually and combine the 
solutions to solve the full problem. By carefully organizing the computation we can efficiently reuse 
the space needed for the subproblems. 

Let Ai be any algorithm that solves the IRVLG problem in time t(n, m, k, a) and space 
s(n,m,k), where n, m, k, and a, are the parameters of the input as above. We build a new 
algorithm Ar from Ai that solves the RVLG problem as follows. Assume without loss of generality 
that n is a multiple of 2(m + B). Divide T into z = — 1 substrings Ci, . . . ,C Z , called chunks. 
Each chunk has length 2(m + B) and overlaps in m + B characters with each neighbor. We run 
Ai on each chunk C\, . . . ,C Z in sequence to compute the implicit gap graph for each chunk. By 
traversing the implicit gap graph for each chunk we output the union of the corresponding match 
combinations. Since each match combination of P in T occurs in at most two neighboring chunks 
it suffices to only store the implicit gap graph for two chunks at any time. 

Next we consider the complexity Ar. Let denote the number of occurrences of the strings 
of P in Cj. For each chunk we run Ai to produce the implicit gap graph. Given these we compute 
the union of match combinations in 0(k(3) time. Hence, algorithm Ar uses time 



O ( ^ i(2(m + B),m, k, on) + k/3 J . 



\i=l / 

Next consider the space. We only need to store the implicit gap graphs for two chunks at any time. 
Since the space required for each chunk is 0((m + B)k), the total space becomes 

O ((m + B)k + s(2(m + B),m, k)) . 

The black-box algorithm efficiently converts algorithms for the IRVLG problem to the RVLG prob- 
lem, resulting in the following theorem. 

Theorem 3 Given an algorithm solving the IRVLG problem in time t(n, m, k, a) and space s(n, m, k), 
there is an algorithm solving the RVLG problem in time O (J2i=i t(2(m + B),m, k, ccj) + k/3) and 
space O ((m + B)k + s(2(m + B),m, k)). 

If we use the result from Theorem 2, we obtain an algorithm that uses time 
O (J^t(2(m + B),m,k,ai^J + kfij = 

O ^ — ^-^ (2(m + B) log k + m) + a + k/3^j = 0(n log k + m + a + k/3) , 
where the term m in the last expression is needed for the case where m> n. The space usage is 



O ((m + B)k + s(2(m + B),m, k)) = 0\ {m + B)k + m + B + max a { = 0((m + B)k) , 

V 1=1,. ..,2 / 

where the last equality holds, since a« < (m + B)k for all i. In summary, we have the following 
result for the RVLG problem. 

Theorem 4 The RVLG problem can be solved in time 0(n log k + m + a + k/3) and space 0((m + 
B)k). 
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4.3 Reporting Match Combinations On the Fly 



We now show how a simple extension of our algorithm provides an alternative solution to the RVLG 
problem achieving the same space and time complexity as the black-box solution. The idea is to 
use Algorithm 2 and report the match combinations on the fly, while continually removing nodes 
from the implicit gap graph that no longer can be part of a match combination. We remove the 
nodes using a method similar to that for removing dead ranges in the lists L{ and L\. 

We say that a node x of Pj in the implicit gap graph is dead if x can not be part of a future 
match combination. This happens when 

k 

t > endpos(x) + bj-i + \Pj\ . 
j=i+i 

Like dead ranges, we can remove dead nodes from the implicit gap graph in amortized constant 
time. Consequently, all match combinations can be reported in time 0(nlogk + m + a + kf3). 
Removing the dead nodes ensures that the number of Pj nodes in the implicit gap graph at any 
time is at most 1 + X^j=i+i bj-i + \Pj\- Thus, the total number of nodes never exceeds 

k k 

E 1+ E b 3-i + \ p i\ = 0((m + B)k). 

i=l j—i+l 

In summary, the algorithm solves the RVLG problem in time 0(n log k + m + a + kf3) and space 
0((m + B)k), so it provides and alternative proof of Theorem 4. 
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